Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8631] Support of hoodie.populate.meta.fields for Flink append mode #12516

Merged
merged 2 commits into from
Dec 21, 2024

Conversation

geserdugarov
Copy link
Contributor

@geserdugarov geserdugarov commented Dec 18, 2024

Change Logs

Currently, hoodie.populate.meta.fields is not supported in Flink. This config should be used for append mode, so this MR adds support for this case.

Before, master, 3cb874f:
1-append-master-3cb874fd

After:
2-append-populate-false

Before After
CPU samples 79.8K 67.9K
Total time of 60M rows writing 108 s 93 s

TPC-H lineitem table is used for profiling.

14% faster write in append mode for Flink when hoodie.populate.meta.fields = false.

The next possible optimization for this scenario is to think about do we really need Bloom filters in this use case, because it costs 16% of CPU after optimization of this MR:
3-append-bloom-filters

Impact

No

Risk level (write none, low medium or high below)

Low

Documentation Update

No need. There is no mention in "All Configurations" page that hoodie.populate.meta.fields is not supported for Flink.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@@ -90,6 +91,7 @@ public HoodieRowDataCreateHandle(HoodieTable table, HoodieWriteConfig writeConfi
this.fileId = fileId;
this.newRecordLocation = new HoodieRecordLocation(instantTime, fileId);
this.preserveHoodieMetadata = preserveHoodieMetadata;
this.skipMetadataWrite = skipMetadataWrite;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the flag preserveHoodieMetadata already control this behavior, there is another PR raised by @usberkeley for fixing all the scenarios BTW: #12404

Copy link
Contributor Author

@geserdugarov geserdugarov Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw #12404 and was confused by ticket name, which mention Flink table config hoodie.populate.meta.fields, and didn't find any changes in hudi-flink-client or hudi-flink-datasource that will change current behavior. My point of view I described in #12404 (comment)
And to support my comment, I've created this MR that shows the lack of support of hoodie.populate.meta.fields in Flink.

Copy link
Contributor Author

@geserdugarov geserdugarov Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserveHoodieMetadata actually used in the code as an indicator to get metadata from row data or generate it by calling corresponding methods. And it actually a little bit confusing naming then. I believe that values for preserveHoodieMetadata could be described by this schema:
preserveHoodieMetadata
Looks like preserveHoodieMetadata could be true only for clustering operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the flag preserveHoodieMetadata indicates whether the source row includes the metadata fields already, for table service like clustering, this should be true by default(because clustering is just a rewrite). For regular write, the metadata fields should be generated on the fly.

Let's check in which case the option hoodie.populate.meta.fields could be false.

Copy link
Contributor Author

@geserdugarov geserdugarov Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User sets value for hoodie.populate.meta.fields option, which is true by default. And in description for this config, "append only/immutable data" is mentioned as use case:

public static final ConfigProperty<Boolean> POPULATE_META_FIELDS = ConfigProperty
.key("hoodie.populate.meta.fields")
.defaultValue(true)
.withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing");

For this reason, in this MR I supported hoodie.populate.meta.fields in Flink only for append mode.

For quick check I use SQL queries like the following ones, which used for append mode:

CREATE TABLE hudi_debug (
    id INT,
    part INT,
    desc STRING,
    PRIMARY KEY (id) NOT ENFORCED
) 
WITH (
    'connector' = 'hudi',
    'path' = '...',
    'table.type' = 'COPY_ON_WRITE',
    'write.operation' = 'insert',
    'hoodie.populate.meta.fields' = 'false'
);
INSERT INTO hudi_debug VALUES 
    (1,100,'aaa'),
    (2,200,'bbb');

Expected results: there is no exceptions during

SELECT * FROM hudi_debug;

and corresponding parquet files in HDFS don't contain columns with metadata.

Copy link
Contributor Author

@geserdugarov geserdugarov Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also found that we could write MOR table in upsert mode without metadata. Call stack in this case will include HoodieAppendHandle. But we couldn't read result MOR table by Flink later due to exception thrown during:

SELECT * FROM hudi_debug;

I've created separate bug for MOR without meta columns read: HUDI-8785, will fix it in a separate MR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we atleast add an integration test in ITTestHoodieDataSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405
I've added ITTestHoodieDataSource::testWriteWithoutMetaColumns. But for proper checking, it would be great to write data by Flink, and then read it by Spark, because in Spark

SELECT * FROM table;

will return all columns including those with metadata.
And it would be really useful check of engines interoperability. I've created a corresponding task HUDI-8788.

@geserdugarov
Copy link
Contributor Author

I will restart Azure CI, but we have a problem with timeout:
https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=2536&view=results
All jobs are successful, but CI failed. I suppose due to running time of "UT spark-datasource Java Tests & DDL", 1 h 29 min 56 s, which is close to 1 h 30 min timeout. Could we increase timeout a little bit?

@geserdugarov
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit ae820af into apache:master Dec 21, 2024
42 of 43 checks passed
@geserdugarov geserdugarov deleted the master-flink-populate-meta branch December 23, 2024 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants